Two of the most important datastructures in Pandas are the Series and DataFrame.

Series

A pandas Series is just an array with an associated index. Basically just an associative array, or in python terms, a fixed length ordered dictionary.



In [3]:

    
from pandas import Series
Series([1,2,3,4,5])
# Note the index on the left hand side when printed









    Out[3]:





0    1
1    2
2    3
3    4
4    5
dtype: int64



In [4]:

    
a = Series([1, 2, 3, 4, 5])
a.values, a.index # Notice how the values of a series is just an array and the Index is a special type









    Out[4]:





(array([1, 2, 3, 4, 5]), RangeIndex(start=0, stop=5, step=1))



In [24]:

    
# By default a Series doesn't have a name, but you can assign it one:
Series([1,2,3,4], name="My Series")









    Out[24]:





0    1
1    2
2    3
3    4
Name: My Series, dtype: int64



In [5]:

    
# You can change the index if you want
indexed_series = Series([1,2,3,4,5], index=['a','b','c','d','e'])
indexed_series









    Out[5]:





a    1
b    2
c    3
d    4
e    5
dtype: int64



In [6]:

    
# You use the assigned index or a numeric index to access elements
indexed_series['c'], indexed_series[0]
# In other words, a Series is really just an ordered dict









    Out[6]:





(3, 1)



In [67]:

    
# In fact, you can create a series right from a python dict
myseries = Series({'Netherlands': 5, 'Belgium': 10, 'France': 21})
myseries









    Out[67]:





Belgium        10
France         21
Netherlands     5
dtype: int64



In [68]:

    
# You can access it by number of by index, note that Belgium comes first as the dictionary is sorted by key
myseries[0], myseries['Netherlands']









    Out[68]:





(10, 5)

Like with regular numpy arrays, you can create boolean masks on Series and use them to filter out certain values in the Series, like so:



In [69]:

    
myseries[myseries % 2 == 1] # filter out odd values









    Out[69]:





France         21
Netherlands     5
dtype: int64



In [71]:

    
# Sorting by value:
myseries.sort_values()









    Out[71]:





Netherlands     5
Belgium        10
France         21
dtype: int64

A useful feature about Series is that the operations that can act on it are often smart enough to combine values by index. For example, when adding to Series together, the values of the same indices (=keys) will automatically be summed together, like so:



In [13]:

    
a_dict = {'Netherlands': 5, 'Belgium': 10, 'France': 21, 'UK': 5}
b_dict = {'Netherlands': 2, 'Belgium': 1, 'France': -9, 'USA': 4}
a = Series(a_dict)
b = Series(b_dict)
a+b









    Out[13]:





Belgium        11.0
France         12.0
Netherlands     7.0
UK              NaN
USA             NaN
dtype: float64

This is not supported in python by default:



In [14]:

    
a_dict + b_dict # Doesn't work, TypeError









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-0a1691d054e4> in <module>()
----> 1 a_dict + b_dict # Doesn't work, TypeError

TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

DataFrame

Where a Series is like a single associative array, a DataFrame is a rectangular table of data which contains an ordered collection of columns (i.e. Series).



In [1]:

    
# python list of dictionaries that all have the same keys
customers = [
    {'name': "John", 'last_name': "Smith", 'age': 43, 'customer_nr': 12345 },
    {'name': "Mary", 'last_name': "Potter", 'age': 25, 'customer_nr': 67889 },
    {'name': "Rose", 'last_name': "Harrison", 'age': 39, 'customer_nr': 23456 },
    {'name': "John", 'last_name': "Ford", 'age': 56, 'customer_nr': 99999 },  
    {'name': "Patrick", 'last_name': "Harrison", 'age': 41, 'customer_nr': 7777 },
]
# print customers



In [2]:

    
# let's create a dataframe from this
import pandas as pd
customers_df = pd.DataFrame(customers)
print customers_df # As you'll notice, pandas does pretty printing!









    



   age  customer_nr last_name     name
0   43        12345     Smith     John
1   25        67889    Potter     Mary
2   39        23456  Harrison     Rose
3   56        99999      Ford     John
4   41         7777  Harrison  Patrick



In [15]:

    
import numpy as np
# customers_df.select(lambda x: x.name == "John")
customers_df[customers_df['name'] == "John"]









    Out[15]:






  
    
      
      age
      customer_nr
      last_name
      name
    
  
  
    
      0
      43
      12345
      Smith
      John
    
    
      3
      56
      99999
      Ford
      John



In [18]:

    
# Alternative way to define a the same thing
customers = {
    'name': ["John", "Mary", "Rose", "John", "Patrick"],
    'last_name': ["Smith", "Potter", "Harrison", "Ford", "Harrison"],
    'age': [43,25,39,56,41],
    'customer_nr': [12345, 67889, 23456,99999,7777]
}
customers_df = pd.DataFrame(customers)
print customers_df









    



   age  customer_nr last_name     name
0   43        12345     Smith     John
1   25        67889    Potter     Mary
2   39        23456  Harrison     Rose
3   56        99999      Ford     John
4   41         7777  Harrison  Patrick



In [47]:

    
# By default, rows are labeled by numbers, but you can also give them labels, like so
population = {'Nevada': {2001: 2.4, 2002: 2.9},
                 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pd.DataFrame(population)

Dataframe manipulation

Pandas then has a bunch of methods on the DataFrame that allows you to easily manipulate the data.



In [33]:

    
# accessing a certain column (this is a Series that get's returned), either by key or attribute name
print customers_df['name'], "\n\n", customers_df.name









    



0       John
1       Mary
2       Rose
3       John
4    Patrick
Name: name, dtype: object 

0       John
1       Mary
2       Rose
3       John
4    Patrick
Name: name, dtype: object



In [36]:

    
# To access a certain row, use df.loc:
customers_df.loc[2] # Note that rows are also Series (you can tell from the output because it has a Name and dtype)









    Out[36]:





age                  39
customer_nr       23456
last_name      Harrison
name               Rose
Name: 2, dtype: object



In [72]:

    
# Transpose:
customers_df.T









    Out[72]:






  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      age
      43
      25
      39
      56
      41
    
    
      customer_nr
      1234
      1234
      1234
      1234
      1234
    
    
      last_name
      Smith
      Potter
      Harrison
      Ford
      Harrison
    
    
      name
      John
      Mary
      Rose
      John
      Patrick
    
    
      signup_date
      2017-05-25 21:27:58.376248
      2017-05-25 21:27:58.376248
      2017-05-25 21:27:58.376248
      2017-05-25 21:27:58.376248
      2017-05-25 21:27:58.376248



In [74]:

    
# Sort by a certain column:
customers_df.sort_values(by='age') # You can also sort by multiple columns by passing an array









    Out[74]:






  
    
      
      age
      customer_nr
      last_name
      name
      signup_date
    
  
  
    
      1
      25
      1234
      Potter
      Mary
      2017-05-25 21:27:58.376248
    
    
      2
      39
      1234
      Harrison
      Rose
      2017-05-25 21:27:58.376248
    
    
      4
      41
      1234
      Harrison
      Patrick
      2017-05-25 21:27:58.376248
    
    
      0
      43
      1234
      Smith
      John
      2017-05-25 21:27:58.376248
    
    
      3
      56
      1234
      Ford
      John
      2017-05-25 21:27:58.376248



In [20]:

    
# count occurences
customers_df['name'].value_counts() # there are 2 johns!









    Out[20]:





John       2
Mary       1
Patrick    1
Rose       1
Name: name, dtype: int64



In [78]:

    
# sum() sums that columns that have a dtype that can be summed. 
# describe() provides statistical data for the columns with a dtype for which it can be done
# There's a lot of more of these convenience functions like mean, median, prod, std, var, min/max, etc.
print customers_df.sum(), "\n\n", customers_df.describe()









    



age             204
customer_nr    6170
dtype: int64 

             age  customer_nr
count   5.000000          5.0
mean   40.800000       1234.0
std    11.054411          0.0
min    25.000000       1234.0
25%    39.000000       1234.0
50%    41.000000       1234.0
75%    43.000000       1234.0
max    56.000000       1234.0



In [40]:

    
# Add a new column and assign a value for existing column
import datetime
customers_df['signup_date'] = datetime.datetime.now()
customers_df ['customer_nr'] = 1234 # Set all values to 1234, you can also pass an array to specify each value
customers_df









    Out[40]:






  
    
      
      age
      customer_nr
      last_name
      name
      signup_date
    
  
  
    
      0
      43
      1234
      Smith
      John
      2017-05-25 21:27:58.376248
    
    
      1
      25
      1234
      Potter
      Mary
      2017-05-25 21:27:58.376248
    
    
      2
      39
      1234
      Harrison
      Rose
      2017-05-25 21:27:58.376248
    
    
      3
      56
      1234
      Ford
      John
      2017-05-25 21:27:58.376248
    
    
      4
      41
      1234
      Harrison
      Patrick
      2017-05-25 21:27:58.376248



In [41]:

    
# Boolean masks: select all customers that are older than 40
# Note the use of 3
customers_df[customers_df['age'] > 40]









    Out[41]:






  
    
      
      age
      customer_nr
      last_name
      name
      signup_date
    
  
  
    
      0
      43
      1234
      Smith
      John
      2017-05-25 21:27:58.376248
    
    
      3
      56
      1234
      Ford
      John
      2017-05-25 21:27:58.376248
    
    
      4
      41
      1234
      Harrison
      Patrick
      2017-05-25 21:27:58.376248



In [42]:

    
# you can plot counts (do matplotlib plots inline)
%matplotlib inline 
name_counts = customers_df['name'].value_counts()
name_counts.plot(kind='barh') # barh = bar horizontal









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x115f5c390>

Dataframes and functions



In [54]:

    
# Let's define a dataframe
numbers = pd.DataFrame({
    'elevation': [123, -23, 456],
    'signal_strength': [-783, 123, 453]
})
numbers









    Out[54]:






  
    
      
      elevation
      signal_strength
    
  
  
    
      0
      123
      -783
    
    
      1
      -23
      123
    
    
      2
      456
      453



In [53]:

    
# Since a dataframe is just a Series of Series, you can apply numpy functions to it, like so:
import numpy as np
np.abs(numbers)









    Out[53]:






  
    
      
      elevation
      signal_strength
    
  
  
    
      0
      123
      783
    
    
      1
      23
      123
    
    
      2
      456
      453



In [57]:

    
# Apply a custom function to a column (=series at a time)
numbers.apply(lambda col_series: col_series + 100)









    Out[57]:






  
    
      
      elevation
      signal_strength
    
  
  
    
      0
      223
      -683
    
    
      1
      77
      223
    
    
      2
      556
      553



In [79]:

    
# Apply a custom function to an element at a time
numbers.applymap(lambda val: 0 if val < 100 else val)









    Out[79]:






  
    
      
      elevation
      signal_strength
    
  
  
    
      0
      123
      0
    
    
      1
      0
      123
    
    
      2
      456
      453

	0	1	2	3	4
age	43	25	39	56	41
customer_nr	1234	1234	1234	1234	1234
last_name	Smith	Potter	Harrison	Ford	Harrison
name	John	Mary	Rose	John	Patrick
signup_date	2017-05-25 21:27:58.376248	2017-05-25 21:27:58.376248	2017-05-25 21:27:58.376248	2017-05-25 21:27:58.376248	2017-05-25 21:27:58.376248